Automatic annotation of the bHLH gene family in plants

Background The bHLH transcription factor family is named after the basic helix-loop-helix (bHLH) domain that is a characteristic element of their members. Understanding the function and characteristics of this family is important for the examination of a wide range of functions. As the availability of genome sequences and transcriptome assemblies has increased significantly, the need for automated solutions that provide reliable functional annotations is emphasised. Results A phylogenetic approach was adapted for the automatic identification and functional annotation of the bHLH transcription factor family. The bHLH_annotator, designed for the automated functional annotation of bHLHs, was implemented in Python3. Sequences of bHLHs described in literature were collected to represent the full diversity of bHLH sequences. Previously described orthologs form the basis for the functional annotation assignment to candidates which are also screened for bHLH-specific motifs. The pipeline was successfully deployed on the two Arabidopsis thaliana accessions Col-0 and Nd-1, the monocot species Dioscorea dumetorum, and a transcriptome assembly of Croton tiglium. Depending on the applied search parameters for the initial candidates in the pipeline, species-specific candidates or members of the bHLH family which experienced domain loss can be identified. Conclusions The bHLH_annotator allows a detailed and systematic investigation of the bHLH family in land plant species and classifies candidates based on bHLH-specific characteristics, which distinguishes the pipeline from other established functional annotation tools. This provides the basis for the functional annotation of the bHLH family in land plants and the systematic examination of a wide range of functions regulated by this transcription factor family. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-023-09877-2.


Identification of outgroup sequences
In order to distinguish between bHLH-like candidates and bona fide bHLHs, a collection of phylogenetically close outgroup sequences is needed.Possible outgroup sequences in A. thaliana were identified by performing a BLASTp v2.12.0+ [1] search against all predicted polypeptide sequences of TAIR10 using the A. thaliana bHLH sequences [2][3][4] as baits.BLAST hits with a bit score above 50 were considered as bHLH-like candidates.Together with the A. thaliana bHLH sequences, the candidates were globally aligned using Muscle5 v5.1.osx64[5].
The alignment was trimmed by removing positions with less than 10% occupancy at a given alignment position.A maximum likelihood tree was constructed via FastTree v2.1.10[6] using the "-wag" option.For inspection, the phylogenetic tree was visualised using iTOL [7].
Candidates forming a monophyletic group distinct from the bona fide bHLHs were identified as non-bHLH outgroup sequences.These outgroup sequences were identified in all species that are listed in Table 1.For each species, the A. thaliana outgroup sequences were included in the BLASTp search in addition to the bHLH sequences of that respective species.BLAST hits revealed outgroup sequences by showing a close phylogenetic relationship to the A. thaliana outgroup sequences, but not to the bHLH sequences of the respective species.

Optimisation of sequence collections through thinning
Large and redundant sequence collections lead to high computational costs and long run times in the following analyses.It is possible to optimise these collections by reducing large groups of very similar sequences to only one representative sequence.The initial bait collection and the initial outgroup collection were separately optimised by thinning based on phylogenetic distance of the individual sequences to generate a small set of sequences that still represent the full phylogenetic diversity of bHLHs.Phylogenetic trees of the collections were constructed with FastTree 2.1.10[6] as described above.DendroPy 4.5.2[8] was deployed to calculate the mean nearest taxon distance and patristic distances between all leaves of the trees.For each leaf, neighbouring leaves with a patristic distance less than the mean nearest taxon distance multiplied by a given factor were identified as closely related group members.Leaves identified as group members were excluded from further group member identifications to prevent overlapping.The leaf with the longest sequence was chosen as representative of the group and added to the optimised collection.Leaves with no phylogenetic neighbour in the mean nearest taxon distance multiplied by ten were identified as singular sequences on extraordinarily long branches and excluded from the optimised collection.
The collection optimisation was performed iteratively.Each step included the construction of a phylogenetic tree followed by thinning.First, sequences were thinned per species.Leaves within a patristic distance less than the mean nearest taxon distance multiplied by factor two were considered as a group.Only the longest sequence was retained to represent this group of paralogous sequences in the following steps.The obtained representative sequences from all species were merged and three rounds of thinning were performed.In these steps, the factor for the mean nearest taxon distance was set to one.The representative sequences from the last step were obtained as the optimised collection, a diverse set of sequences with low lineage redundancy that still allows the identification of lineage-specific bHLH sequences.The HMMER 3.3.2[9] program "hmmbuild" was used to create a HMM motif of the optimised bait collection.

Table 1 :
Plant polypeptide sequence datasets used to collect bHLH sequences.The species, version and database source are given.